Two-dimensional materials-based probabilistic synapses and reconfigurable neurons for measuring inference uncertainty using Bayesian neural networks

Artificial neural networks have demonstrated superiority over traditional computing architectures in tasks such as pattern classification and learning. However, they do not measure uncertainty in predictions, and hence they can make wrong predictions with high confidence, which can be detrimental for many mission-critical applications. In contrast, Bayesian neural networks (BNNs) naturally include such uncertainty in their model, as the weights are represented by probability distributions (e.g. Gaussian distribution). Here we introduce three-terminal memtransistors based on two-dimensional (2D) materials, which can emulate both probabilistic synapses as well as reconfigurable neurons. The cycle-to-cycle variation in the programming of the 2D memtransistor is exploited to achieve Gaussian random number generator-based synapses, whereas 2D memtransistor based integrated circuits are used to obtain neurons with hyperbolic tangent and sigmoid activation functions. Finally, memtransistor-based synapses and neurons are combined in a crossbar array architecture to realize a BNN accelerator for a data classification task.

1. The authors spend a significant amount of time comparing the devices to two terminal memristors, and listing the advantages in that respect. However, the MoS2 transistor as described is a three terminal charge trapping memory, and hence not particularly different from a standard three terminal charge trapping memory. While it is impressive that the authors have developed a CTM devices that incorporates MoS2, it appears the circuits would all be possible to implement in a standard, commercial CTM memory, such as TANOS, SONOS, etc. The random distributions also seem likely an effect of charge trapping, which would be possible in a standard CTM. Please elaborate on whether this is the case, or there is something significantly different about the physics of the MoS2 that cannot be replicated in a standard CTM memory.
2. The "tanh circuit" created by using T1 as a resistive load is a common drain amplifier, which is a standard transistor amplifier circuit with a well-known transfer function. I would recommend acknowledging this because it seems a bit like this is being presented as a novel circuit (which would seem odd to the microelectronic circuits community). I have also seen similar common drain circuit implemented with floating gate devices, giving similar VTC behavior, such as in Fig 5 of the referenced paper below.
3. For the LTSpice simulations, can you elaborate on the tanh function implementation; is this just modeled with a single resistor and NMOS? If the VTC looks reasonable (i.e. Fig 4h) then why is this a major source of inaccuracy? If this circuit is so inaccurate, why would it be used rather than the referenced previously used NMOS/PMOS version in ref 39? 4. A minor, optional note on the amplifier circuits, the more common convention is to have the high voltage on the top of the circuit such that current flows down. (It is opposite of this in Fig 4(g) and Supplemental Fig 7(a). 5. How was device to device variation modeled in LTSpice? Is there an experimental link between physical nonidealities and the modeling factors should be presented? Also, do the authors know the programming error of the devices, i.e. the delta between the expected and actual conductance after programming? 6. What is the gate voltage (or gate voltage range) of the synapses during read? What operating region is the transistor in at this gate voltage? It sounds like these are modulated, but in LTSpice, if you are treating the two device pair as a resistor, I am assuming the gate to source and drain to source voltages are such that both devices are operating in the linear part of the triode region, such that a resistor is a suitable element to represent the pair. However this is unclear from the description. 7. The authors should model something larger than the IRIS dataset, which is quite far from a dataset and network that would be used in a real world application. It has been observed that have seen that excellent accuracy on small datasets does not translate to high accuracy on larger datasets. There is little if any practical application for using a custom, efficient accelerator to process a dataset as small as IRIS. I would understand the very small dataset if this was an experimental demonstration, but given that it is a simulation only, the authors should include something larger.
output (once per sample). Having to reprogram the crossbar numerous times to perform a single inference seems an enormous energy cost. Due to this concern, the paper should include an energy analysis with some benchmarks, in my opinion.
These multiple programming operations also raise the issue of device endurance, which should be discussed.
Also, if the synapses need to be reprogrammed each sample, we need additional memory arrays storing the mean value and standard deviations of each weight. This is an important cost, and this would also limit the energy efficiency of the authors' approach, as important data movement will be involved.
A significant concern is that Fig 5 shows no Bayesian result: we only see the final accuracy, but no distributions.
Also, why are Bayesian neural networks useful for this example? I think that conventional neural networks get excellent accuracy on this task.
The degradation in accuracy between software and LTSpice simulation is quite severe, even for a very simple task, due to the neuron behavior. This is an important limitation. Can this problem be fixed?
Fig 5g with 0% variation shows a test accuracy that is practically 100%. However, in the text, the accuracy is said to be 93.78 %. This is a major concern.
Does Fig 5g include the effects of variability of the neuron devices? I understand that it does not, and that would be a problem, as this variability might have worse effects than the one of synapses. The accuracy is assessed using LTSpice on the IRIS dataset and with certain configurations is found to be reasonably high.
The topic of novel devices to implement BNNs is of interest to the community, as is the quantification of uncertainty in neural inference. This paper is very well written and well organized. However, before considering for publication in Nature Communications, there are some key issues which should be considered to clarify the novelty and significance of the results.
We are happy to know that the reviewer finds our manuscript is well written and is of interest to the community. A point-by-point response to the comments raised by the reviewer can be found below.
1. The authors spend a significant amount of time comparing the devices to two terminal memristors, and listing the advantages in that respect. However, the MoS2 transistor as described is a three terminal charge trapping memory, and hence not particularly different from a standard three terminal charge trapping memory. While it is impressive that the authors have developed a CTM devices that incorporates MoS2, it appears the circuits would all be possible to implement in a standard, commercial CTM memory, such as TANOS, SONOS, etc. The random distributions also seem likely an effect of charge trapping, which would be possible in a standard CTM. Please elaborate on whether this is the case, or there is something significantly different about the physics of the MoS2 that cannot be replicated in a standard CTM memory.
The random distributions are observed as an effect of random nature of charge trapping which is typically observed in charge-trap memory (CTM) devices [4,5]. Hence, our implementation can be easily translated to other CTM devices. Our MoS2 memtransistors offer an alternative to other CTM, such as TANOS and SONOS. The atomically thin two-dimensional (2D) semiconductors allow geometric miniaturization of field-effect transistors (FETs) without any loss of electrostatic integrity. The scalability of FETs is captured through the screening length ( SC ), shown in Eq. R1, which represents the competition between gate and drain potential for control of the channel charge.
Here, s and ox are the thicknesses and s and ox are the dielectric constants of the semiconducting channel and the insulating oxide, respectively. To avoid short channel effects the channel length of an FET ( CH ) has to be at least three times higher than the screening length, i.e.
Hence, we have utilized MoS2 memtransistors for the implementation of BNN.
We have clarified that the circuits can also be implemented with standard flash memories in the main manuscript.
2. The "tanh circuit" created by using T1 as a resistive load is a common drain amplifier, which is a standard transistor amplifier circuit with a well-known transfer function. I would recommend acknowledging this because it seems a bit like this is being presented as a novel circuit (which would seem odd to the microelectronic circuits community). I have also seen similar common drain circuit implemented with floating gate devices, giving similar VTC behavior, such as in Fig 5 of the referenced paper below.
We agree with the reviewer that this is a common-drain amplifier circuit. While the output characteristics of the common-drain amplifier circuit is well-known [11], to the best of our knowledge, it has not been used in the context of neuromorphic computing in order to implement the tanh/sigmoid activation function. As demonstrated in the manuscript, this simple and energy efficient circuit can be used to implement the tanh activation function which is otherwise implemented using look-up-tables (LUT) consisting of numerous transistors [3].
We have clarified that the output characteristics is well-known and we have added the reference suggested by the reviewer in the main manuscript.

For the LTSpice simulations, can you elaborate on the tanh function implementation; is
this just modeled with a single resistor and NMOS? If the VTC looks reasonable (i.e. Fig 4h) then why is this a major source of inaccuracy? If this circuit is so inaccurate, why would it be used rather than the referenced previously used NMOS/PMOS version in ref 39?
We thank the reviewer for raising this question. As the reviewer rightly points out, the tanh activation function is modelled using a resistor and an n-type FET, which results in an asymmetry in the transfer function. We agree that an implementation using the integration of an n-type FET and a p-type FET would result in better accuracy. Hence, we demonstrate a circuit for a modified tanh (m-tanh) activation function using a n-type MoS2 memtransistor (T1) and a V-doped p-type Here, S is applied to the gate of T1 and T2. T1 and T2 are programmed to ensure that the m-tanh function passes through the origin. Note that when S = -2 V, T1 operates in the off-state and T2 and becomes more conductive than T2, which results in O = DD . While this output characteristics is well-known [3,12], to the best of our knowledge it has not been used to implement activation functions, as the activation functions are typically implemented using look-up-tables [3].
Additionally, modified sigmoid activation function can be realized by applying 0 V to the drain terminal of T1, as shown in Fig. R1c and Fig. R1d.
We have revised the implementation of the activation function and added this discussion in the main manuscript. We thank the reviewer for this suggestion. We have changed the figures to follow this common convention. We are happy to provide further clarification. The synapses consisting of 2 memtransistors, T + and T − are modelled using resistors, as shown in Fig. R2a. Here, T + is modelled as a resistor which draws from Gaussian conductance distribution given by +~� + , + � and T − is modelled as a resistor set to a constant conductance of − . This results in the synapse having an effective conductance distribution given by ~� eff , eff �, where, eff = + -− and eff = + , enabling independent control of the mean and standard deviation of . 5 While the cycle-to-cycle variation in programming is useful for the generation of Gaussian random numbers, the device-to-device variation is undesirable. Fig. R2b demonstrates the device-to-device variation across 40 MoS2 memtransistors while performing program/erase operation. Here, the progressively higher programming voltage pulses ( P ) followed by progressively higher erase voltage pulses ( E ) are applied to the MoS2 memtransistors and the drain-to-source current ( DS )

How was device to device variation modeled in LTSpice
is measured at a read voltage ( R ) of 0 V. This is repeated for 2 cycles, to demonstrate cycle-tocycle variation. While the cycle-to-cycle variation is beneficial for a BNN, device-to-device variation is detrimental to its operation.  Here, is the scaling factor and is the exponential factor. If we account for this dependence, the device-to-device variation can be reduced. Δ DS / DS,start follows a Gaussian distribution, as shown representatively in Fig. R2e for the erase operation.
In order to evaluate the effect of device-to-device variation, we implement up to 10 % variation in the parameters + , + , and − , which results in a corresponding variation in . In line with the above discussion, they are drawn from Gaussian distributions where the mean is given by expected + , + , and − and standard deviation of up to 10 % is considered. In the revised manuscript, we have also modelled the effect of device-to-device variation in the neurons. For this, we assume up to 10 % variation in the threshold voltage for the n-type FET ( TH, ) and the p-type FET ( TH, ). We have also implemented up to 10 % variation in resistance of the sense resistor.
We have included this discussion in the supplementary information and main manuscript.

What is the gate voltage (or gate voltage range) of the synapses during read? What
operating region is the transistor in at this gate voltage? It sounds like these are modulated, but in LTSpice, if you are treating the two-device pair as a resistor, I am assuming the gate to source and drain to source voltages are such that both devices are operating in the linear part of the triode region, such that a resistor is a suitable element to represent the pair.
However, this is unclear from the description.
We are happy to provide further clarification. A gate-to-source voltage ( GS ) or read voltage ( R ) of 0 V is used during the read step. R of 0 is used in order to avoid read-disturb caused by a positive or negative GS . As the reviewer rightly pointed out, both the GS and drain-to-source voltage ( DS ) are used such that the transistors are operating in the linear part of the triode region.
Hence, they are modelled using resistors.
We have added this clarification in the main manuscript.

The authors should model something larger than the IRIS dataset, which is quite far from a dataset and network that would be used in a real-world application. It has been observed
that have seen that excellent accuracy on small datasets does not translate to high accuracy on larger datasets. There is little if any practical application for using a custom, efficient accelerator to process a dataset as small as IRIS. I would understand the very small dataset if this was an experimental demonstration, but given that it is a simulation only, the authors should include something larger.
We appreciate the suggestion by the reviewer. We have now implemented a BNN classifier to classify the PIMA Indian diabetes dataset. This dataset consists of nine parameters such as number of pregnancies, glucose levels, insulin levels, body mass index, age, etc. To classify this dataset, we use a fully connected 8×10×2 BNN i.e., it has an input layer with 8 neurons, one hidden layer with 10 neurons, and an output layer with 2 neurons. The dataset with 767 instances is divided into 720 for training and 47 for testing. Bayes by Backprop algorithm, with a Gaussian prior is used to train the synaptic weight distributions [13,14]. The BNN is trained off-chip for 300 epochs as shown in Fig. R3a, to obtain train accuracy of 75.41 % and test accuracy of 80.85 %. Similar accuracy numbers have been reported in prior works [15][16][17].
Following Eq. R3, the output of the BNN accelerator is sampled =100 times to obtain a predictive distribution, and its mean is used to make the classification.
Here, * and * are the test input and output, respectively, D is the training data, and represents the th Monte Carlo weight sample. The softmax of predictive distribution can be used to calculate the uncertainty in classification or entropy given by Eq. R4. [18,19].
Here, � * is the softmax output and is the number of output classes. The entropy can be decomposed into epistemic entropy i.e., uncertainty in model and aleatoric entropy i.e., uncertainty in data as shown in Eq. R5.
Here, on the right-hand side, the first term represents epistemic entropy and the second term represents aleatoric entropy. Aleatoric entropy is the average entropy for fixed weights and hence the uncertainty arises from the data. Epistemic entropy can be obtained by subtracting aleatoric entropy from total entropy, following Eq. R5.
The BNN accelerator to classify PIMA Indian diabetes dataset is evaluated using LTSpice simulations. The synapses consisting of 2 memtransistors, T + and T − are modelled using resistors.
Here, T + is modelled as a resistor which draws from Gaussian conductance distribution given by +~� + , + � and T − is modelled as a resistor set to a constant conductance of − . The sense transistor is modelled using a resistor. The tanh activation function is implemented using the combination of an n-type FET and a p-type FET. Using the BNN accelerator, we are able replicate the test accuracy of 80.85 %. In order to evaluate the effect of device-to-device variation in memtransistors, we implement up to 10 % variation for the parameters of the synapse: + , + , and − , which results in a corresponding variation in , the parameters of the tanh function: TH, and TH, , and for the n-type FET and the p-type FET, and the conductance of the sense resistor S . These parameters are drawn from Gaussian distributions where the mean is given by

) Accuracy and predicative accuracy as function of model variation. Here, the effect of variation in synaptic devices, sense resistors, and activation function is demonstrated. c) Total entropy, aleatoric entropy and epistemic entropy as a function of model variation. d) Accuracy and predictive accuracy as a function of input variation. e) Total entropy, aleatoric entropy and epistemic entropy as a function of input variation.
9 expected parameter value and standard deviation of up to 10 % is considered. Fig. R3b shows the effect of device-to-device (model) variation on the test accuracy and predictive accuracy. The predictive accuracy demonstrates how well the predictions of the expected correct classes are made. Here, the BNN is simulated and averaged over 5 runs. While, we observe a decrease in the test accuracy, it is not seen to significantly impact the operation of the BNN and an accuracy of ≈ 60 % is maintained for 10 % variation. In a BNN, we can use entropy estimation and entropy decomposition to quantify uncertainty and to find its source, respectively. Fig. R3c shows the total entropy, aleatoric entropy, and epistemic entropy as a function of model variation, calculated using Eq. R4 and Eq. R5. Here, we would expect the aleatoric uncertainty to remain unchanged and total entropy to increase. However, their extraction is impacted by the increased model variation.
Nevertheless, as expected the epistemic entropy increases as the model variation increases. As shown in Fig. R3d and Fig. R3e, an increase in the input variation results in the degradation of accuracy, along with an increase in the total and aleatoric entropy, while a constant epistemic entropy is maintained, as expected. Hence, in addition to the estimation of total entropy using a BNN, with uncertainty decomposition various sources of entropy can be identified.
We have changed the classification dataset and added this discussion in the main manuscript.
8. Although the title conveys an interesting prospect, it does not seem well connected to the content and I would suggest modifying it. The work aims to quantify uncertainty, but does not "avoid inference inaccuracy".
In line with the reviewer's suggestion, we have changed the title to "A Bayesian Neural Network to Quantify Inference Inaccuracy". We are happy to know that the reviewer appreciates the idea of exploiting stochasticity in devices to implement probabilistic computing demonstrated in the paper. A point-by-point response to the comments raised by the reviewer can be found below. We are happy to add further references supporting our statement. Various memristors, transistors and other devices based on 2D materials have been explored for neuromorphic computing applications [20][21][22]. 2D materials have also been explored for optoelectronic synapses enabled by their optically active monolayers [23,24]. Also, note that the atomically thin 2D semiconductors allow geometric miniaturization of FETs without any loss of electrostatic integrity. The scalability of FETs is captured through SC , shown in Eq. R6, which represents the competition between gate and drain potential for control of the channel charge.
Here, s and ox are the thicknesses and s and ox are the dielectric constants of the semiconducting channel and the insulating oxide, respectively. To avoid short channel effects, CH has to be at least three times higher than the screening length, i.e. LCH > 3λSC [6]. The atomically thin semiconducting monolayers allows extreme scalability, as they offer s lower than 1 nm.
Among various semiconducting 2D materials, MoS2 has gained the most attention owing to its dominant n-type transport, stability, and ease of high-quality growth [7][8][9][10]. Hence, we have utilized MoS2 memtransistors for the implementation of BNN.
We have added more references to support our statement in the main manuscript.

The title of the article, "A Neural Network Accelerator to Avoid Inference Inaccuracy",
does not seem related to the paper results. The authors do not present a neural network accelerator, and the paper never really talks about avoiding inference inaccuracy. This was a little surprising.
In line with the reviewer's suggestion, we have changed the title to "A Bayesian Neural Network to Quantify Inference Inaccuracy". predominantly p-type with large work function contact metals such as Ni [7,[26][27][28].
For the implementation of the BNN, we design synapses that can draw random numbers from a Gaussian distribution. Fig. R4e shows the design of our GRNG-based synapse with independent control over its mean ( ) and standard deviation ( ), using two MoS2 memtransistors, T + and T − .
Here, the input to the synapse, in is applied as + in andin to T + and T − , respectively, as shown in Fig. R4e. The current at the output node, out is then given by sum of currents through T + and T − i.e., T + and T − , according to the KCL. To control eff and eff , T + is subjected to successive erase-program-read pulse cycles, while T − is programmed to a given state and subsequently only read, using the waveforms shown in Fig. R4f. This results in + being drawn from a Gaussian distribution, with + = 5 nS and + = 0.49 nS i.e., +~ and − having a constant value of ≈ We are happy to benchmark our devices with other alternative approaches. Table R1 shows the comparison in energy consumption between different memory technologies such as memristor, phase-change memory (PCM), NAND and our 2D memtransistors [29]. While the program/erase voltages and times for the transistor memory technologies are high, the energy consumption is similar as the programming current is much lower for the transistor technologies (≈10 -11 A) compared to memristors and PCM. Also note that, while our demonstration uses 2D memtransistors, it can also be implemented using commercial NAND technologies. We have included this table in the supplementary information.

I have concerns about the proposed crossbar architecture. I understand that for each
presented input, the crossbar devices need to be reprogrammed multiple times to provide a distribution at the output (once per sample). Having to reprogram the crossbar numerous times to perform a single inference seems an enormous energy cost. Due to this concern, the paper should include an energy analysis with some benchmarks, in my opinion.
We thank the reviewer for their recommendation. We have now included an energy analysis. The total energy consumption of the BNN accelerator is given by Eq. R7.
Here, Total represents the total energy consumption per test sample. syn , sense , and tanh represents the total energy consumption by the synapses, sense resistors, and the tanh circuit, respectively, during the inference step. S represents the current through the sense resistor and is the total number of devices. P and E represents the energy consumption for programming and erasing operation on these components. P and E respectively, are the gate currents associated with We have included the energy analysis main manuscript.

6.These multiple programming operations also raise the issue of device endurance, which
should be discussed. We have included the endurance characteristics in the main manuscript.

Also, if the synapses need to be reprogrammed each sample, we need additional memory arrays storing the mean value and standard deviations of each weight. This is an important
cost, and this would also limit the energy efficiency of the authors' approach, as important data movement will be involved.
We are happy to provide further clarification. We propose to use a scheme where all devices are erased with a high E of 13 V. Following that, each device must be programmed with the separate Figure R5. Endurance charachteristics of an MoS2 memtransistor for 2000 cycles. P values to set them to their expected states. Hence, we agree that we will need to store the P values to obtain the required mean and standard deviation.
We have included this clarification in the main manuscript. We are happy to include more Bayesian results. We have now implemented a BNN classifier to classify the PIMA Indian diabetes dataset. This dataset consists of nine parameters such as number of pregnancies, glucose levels, insulin levels, body mass index, and age. To classify this dataset, we use a fully connected 8×10×2 BNN i.e., it has an input layer with 8 neurons, one hidden layer with 10 neurons, and an output layer with 2 neurons. The dataset with 767 instances is divided into 720 for training and 47 for testing. Bayes by Backprop algorithm, with a Gaussian prior is used to train the synaptic weight distributions [13,14]. The BNN is trained off-chip for 300 epochs as

) Accuracy and predicative accuracy as function of model variation. Here, the effect of variation in synaptic devices, sense resistors, and activation function is demonstrated. c) Total entropy, aleatoric entropy and epistemic entropy as a function of model variation. d) Accuracy and predictive accuracy as a function of input variation. e) Total entropy, aleatoric entropy and epistemic entropy as a function of input variation.
shown in Fig. R6a, to obtain train accuracy of 75.41 % and test accuracy of 80.85 %. Similar accuracy numbers have been reported in prior works [15][16][17].
Following Eq. R8, the output of the BNN accelerator is sampled =100 times to obtain a predictive distribution, and its mean is used to make the classification.
Here, * and * are the test input and output, respectively, D is the training data, and represents the th Monte Carlo weight sample. The softmax of predictive distribution can be used to calculate the uncertainty in classification or entropy given by Eq. R9. [18,19].
Here, � * is the softmax output and is the number of output classes. The entropy can be decomposed into epistemic entropy i.e., uncertainty in model and aleatoric entropy i.e., uncertainty in data as shown in Eq. R10.
Here, on the right-hand side, the first term represents epistemic entropy and the second term represents aleatoric entropy. Aleatoric entropy is the average entropy for fixed weights and hence the uncertainty arises from the data. Epistemic entropy can be obtained by subtracting aleatoric entropy from total entropy, following Eq. R10.
The BNN accelerator to classify PIMA Indian diabetes dataset is evaluated using LTSpice simulations. The synapses consisting of 2 memtransistors, T + and T − are modelled using resistors.
Here, T + is modelled as a resistor which draws from Gaussian conductance distribution given by +~� + , + � and T − is modelled as a resistor set to a constant conductance of − . The sense transistor is modelled using a sense resistor. The neuron is implemented using the combination of an n-type FET and a p-type FET. Using the BNN accelerator, we are able replicate the test accuracy of 80.85 %. In order to evaluate the effect of device-to-device variation in memtransistors, we implement up to 10 % variation for the parameters of the synapse: + , + , and − , which results in a corresponding variation in , the parameters of the neuron: TH, and TH, , and for the ntype FET and the p-type FET, and the conductance of the sense resistor S . These parameters are drawn from Gaussian distributions where the mean is given by expected parameter value and standard deviation of up to 10 % is considered. Fig. R6b shows the effect of device-to-device (model) variation on the test accuracy and predictive accuracy. The predictive accuracy demonstrates how well the predictions of the expected correct classes are made. Here, the BNN is simulated and averaged over 5 runs. While, we observe a decrease in the test accuracy, it is not seen to significantly impact the operation of the BNN and an accuracy of ≈ 60 % is maintained for 10 % variation. In a BNN, we can use entropy estimation and entropy decomposition to quantify uncertainty and to find its source, respectively. Fig. R6c shows the total entropy, aleatoric entropy, and epistemic entropy as a function of model variation, calculated using Eq. R4 and Eq. R5. Here, we would expect the aleatoric uncertainty to remain unchanged and total entropy to increase.
However, their extraction is impacted by the increased model variation. Nevertheless, as expected the epistemic entropy increases as the model variation increases. As shown in Fig. R6d and Fig.   R6e, an increase in the input variation results in the degradation of accuracy, along with an increase in the total and aleatoric entropy, while a constant epistemic entropy is maintained, as expected.
Hence, in addition to the estimation of total entropy using a BNN, with uncertainty decomposition various sources of entropy/uncertainty can be identified.
We have added this discussion in the main manuscript.

Also, why are Bayesian neural networks useful for this example? I think that conventional neural networks get excellent accuracy on this task.
We thank the reviewer for their suggestion. As discussed in detail for the previous question, we have now implemented BNN classification on the PIMA Indian diabetes dataset, where we have a test accuracy of 80.85 %. This dataset has typically demonstrated similar accuracy numbers as evident from prior works [15][16][17], and hence serves better to demonstrate uncertainty estimation.
We have now implemented BNN to classify the PIMA Indian dataset and revised the main manuscript.

The degradation in accuracy between software and LTSpice simulation is quite severe, even for a very simple task, due to the neuron behavior. This is an important limitation. Can this problem be fixed?
We thank the reviewer for raising this question. As the reviewer rightly points out, when the tanh activation function is modelled using a resistor and an n-type FET, there is an asymmetric transfer function, which leads to a degradation in the accuracy. To avoid this, we have now implemented the tanh activation function using an n-type FET and a p-type FET, which enables us to replicate the test accuracy of 80.85 % in LTSpice simulations. For the tanh activation function, we demonstrate a circuit for a modified tanh (m-tanh) activation function using a n-type MoS2 memtransistors (T1) and a V-doped p-type WSe2 memtransistor (T2) as shown in Fig for S = 2 V, T1 operates in the on-state and becomes more conductive than T2, which results in O = DD . While this output characteristics is well-known [3,12], to the best of our knowledge it has not been used to implement activation functions, as the activation functions are typically implemented using look-up-tables [3]. Additionally, modified sigmoid activation function can be realized by applying 0 V to the drain terminal of T1, as shown in Fig. R7c and Fig. R7d. We have revised the implementation of the activation function and added this discussion in the main manuscript. Fig 5g with 0% variation shows a test accuracy that is practically 100%. However, in the text, the accuracy is said to be 93.78 %. This is a major concern.

11.
We are happy to provide further clarification. Here, the degradation of accuracy was a result of the asymmetry in the tanh activation function implemented using and n-type FET and a resistor. By implementing a neuron with a symmetric tanh activation function, using the integration of an ntype FET and a p-type FET enabled us to replicate the test accuracy of 80.85 % in LTSpice simulations.

Does Fig 5g include the effects of variability of the neuron devices? I understand that it
does not, and that would be a problem, as this variability might have worse effects than the one of synapses.
The reviewer raises an excellent point. Hence, we have now included the effect of neuron variation as well. In order to evaluate the effect of device-to-device variation in memtransistors, we implement up to 10 % variation for the parameters of the synapse: + , + , and − , which results in a corresponding variation in , the parameters of the neuron: TH, and TH, , and for the ntype FET and the p-type FET, and the conductance of the sense resistor S . These parameters are drawn from Gaussian distributions where the mean is given by expected parameter value and In error analysis, we have now included the effect neuron variation as well.
13. The caption of Figure 5 lacks details. The methods section should include the methods associated with Figure 5.
We have now included an extensive discussion in the Methods section to provide further details on Fig. 5.
I have reviewed the response and new manuscript, including the new tanh circuit. This version has addressed many of my previous concerns in their detailed response. However, I still would consider the following two issues with the manuscript, in determining if it is impactful enough to publish in Nature Communications.
The most significant issue with the paper is that there does not seem to be an advantage to using a MoS2 charge trapping device for this work. The same body of work could have been demonstrated using measured data from a SONOS, TANOS, or other modern, commercial device (perhaps a floating gate cell) -almost certainly with more consistent and stable electrical behavior due to maturity (including consistent random noise). I agree with the standard scaling argument for MoS2 transistors (as logic devices) that the authors presented in their response, but this scaling argument is not really relevant to how the device is being used in the paper. Hence, this MoS2 is not needed to achieve the functionality demonstrated in the paper. My general impression is that the paper is implying that this novel (but previously demonstrated) MoS2 device is enabling this new functionality -the paper must be clear that a new device is -not-required to achieve the presented functionality. Hence, it also follows that the novelty of the paper must be judged on the new circuit and application concepts, and whether these results constitute publication in Nature Comm.
In general, I did find the circuit and relevant mathematical analysis presented in Fig 5 and the relevant equations compelling and well thought out, albeit an unusual circuit. I would comment that one significant challenge of this configuration will be that it is very dependent on the programming accuracy of the resistance of the three devices at the end/bottom of each column, and programming error of these "neurons" will lead to significantly greater accuracy degradation than the same error would cause in the synapses. Put slightly differently, any noise or variation in those column end devices will essentially have equal footing to the sum of all the noise from synapses in the column.
Reviewer #2 (Remarks to the Author): The authors have very significantly improved the quality of the manuscript. I still have a few comments.
-The endurance of 2000 is in fact low, as each inference will consume many of these cycles. Also, the endurance experiment only validates that the ON and OFF states can be distinguished, not the fact the probabilistic synapse behavior is retained. This experiment is important, I think. The endurance issue should also be clearly highlighted as one of the major limitations of the approach.
-The title of the paper is still not adapted. All Bayesian neural networks quantify accuracy, and this particular aspect is not very developed in the manuscript. A more relevant title could be something like "Two-dimensional materials-based synapses for Bayesian neural network". We are happy to know that we have addressed your previous concerns. A point-by-point response to the comments raised by the reviewer can be found below. We are happy to provide further clarification. We included the following statements in the main manuscript to clarify that MoS2 memristors can be replaced with devices with charge-trap memory.
"Here, the random distributions are observed as an effect of random nature of charge trapping which is typically observed in charge-trap memory devices [1,2]. Hence, these memtransistors can also be replaced with standard three-terminal charge-trap flash memories such as TaN-Al2O3-Si3N4-SiO2-Si (TANOS) and Si-SiO2-Si3N4-SiO2-Si (SONOS) [3][4][5]." Note that MoS2 is being used here as it has emerged as a potential alternative to Si in recent years. Hence, MoS2 and other 2D materials have been used as channel material for devices with floating-gate to demonstrate non-volatile memory [6][7][8][9]. Hence, we have utilized MoS2 memtransistors for the implementation of BNN.
2. In general, I did find the circuit and relevant mathematical analysis presented in Fig 5 and the relevant equations compelling and well thought out, albeit an unusual circuit. I would comment that one significant challenge of this configuration will be that it is very dependent on the programming accuracy of the resistance of the three devices at the end/bottom of each column, and programming error of these "neurons" will lead to significantly greater accuracy degradation than the same error would cause in the synapses. Put slightly differently, any noise or variation in those column end devices will essentially have equal footing to the sum of all the noise from synapses in the column.
We are happy to know that the reviewer finds the circuit and mathematical analysis compelling and well thought out. We agree that evaluating the programming error introduced by the neuron must be analyzed. In order to evaluate the effect of device-to-device variation in memtransistors, In the revised manuscript, we have clarified that variation in synapses is more detrimental to the operation of the BNN. 3

Reviewer #2 (Remarks to the Author):
The authors have very significantly improved the quality of the manuscript. I still have a few comments.
We are happy to know that the reviewer finds that the manuscript has improved. A point-by-point response to the comments raised by the reviewer can be found below.
1. The endurance of 2000 is in fact low, as each inference will consume many of these cycles.  and Fig. R2c. The moving and are obtained across 100 samples at a time. Clearly the random number generation is stable, and the expected synaptic behaviors of increasing and as a function of DS and | P | is retained after numerous measurements. However, we agree with the reviewer that the endurance must be improved for MoS2 memtransistors and we have clarified this in the main text.

Also
We have included this discussion in the main manuscript and supplementary information.
2. The title of the paper is still not adapted. All Bayesian neural networks quantify accuracy, and this particular aspect is not very developed in the manuscript. A more relevant title could be something like "Two-dimensional materials-based synapses for Bayesian neural network".
We thank the reviewer for their suggestion. We have now changed the tile to "Two-dimensional materials-based synapses for Bayesian neural network".

The comparison with Ref 30 that the authors gave me in their answer should be included in the manuscript.
We have included this discussion in the manuscript.
I appreciate the authors work modeling the neural net error as a function of variation in the neuron and synapses and found that an important addition to the manuscript. However, in the model of neuron variability vs accuracy, I have found an inconsistency that I've been unable to reconcile. The finding of Fig R1b shows the sense conductance Gs (which should be specific to each column, as Gsj) can vary by 1000% with a negligible impact on accuracy -which is a very surprising finding and merits a detailed explanation. Mathematically, for this to be true, the Gs term in the denominator of the first summation term in eqn 4a must be negligible compared to the other term (sum) in the denominator. However, this contradicts the original purpose of Gsj correction -which was to correct the error associated with each column, stated as " sj( ) is used to make sure that each column of the crossbar array has the same ( ).". If Gsj, which represents the inaccuracy of each column, is negligible, then why is this compensation scheme even needed? The proposal of the compensation term implies that Gs is large enough to significantly affect the denominator of 4a. But the plot in R1b implies that Gs has a negligible affect on the accuracy. Given the strong effect of Vthn and Vthp on the output, a similar argument could be made for the very small Vt variation which was varied in fig R1b. Perhaps more details on how specifically the 10% variation was modeled would clarify this result. In any case, I strongly recommend checking and clarifying this before the article is publishedspecifically assert whether Gs term has a non-negligible effect on the accuracy and is needed for column compensation, or if Gs is negligible and does not need to be adjusted between columns.
Reviewer #2 (Remarks to the Author): The authors have addressed my last comments well. We are happy to know that the reviewer appreciates our response. We are happy to provide further clarification on the relevance of the sense transistor. Note that the primary purpose of the sense transistor is to perform a current-to-voltage conversion. Here, the current through a column (corresponding to the dot-product of input and weights) is converted to a proportional voltage given by Eq. R1.
Here, ( ) is the input voltage, ( ) is the conductance of synapse, and S ( ) is the conductance of the sense transistor for the th row, th column and th layer. The current-to-voltage conversion allows us to seamlessly integrate the circuit for m-tanh activation function into the crossbar array, as the input to the m-tanh activation function is a voltage. The secondary purpose of the sense transistor is to ensure that the nonideality in Eq. R1 i.e., the denominator term is consistent between different columns. Note, that for our implementation, the difference in ∑ ( ) =1 between different columns is not significant and hence the S ( ) correction in our case is small in magnitude.
Additionally, note that each synapse is represented by two devices instead one and hence their combined conductance leads to a much higher magnitude of ∑ is significantly different between columns, the impact error in S ( ) on the classification accuracy will be higher.
We believe that the impact of variation the tanh activation (implemented through variation in TH, and TH, ) is low due to the binary nature of the classification problem that we have chosen.
For a multiclass classification problem, the impact of the variation in activation function for is expected to be larger. In fact, in the initial version of the manuscript, we inspected a multiclass problem, where the impact of variation in the activation function was seen to be higher. To evaluate the effect of model variation, the parameters are drawn from Gaussian distributions where the mean is given by expected parameter value and standard deviation of up to 10 % is considered. We thank the reviewer for this suggestion. We have made this clarification in the main manuscript.